1 Introduction

The aim of this work is to identify direct links between injury risk of professional soccer players and certain relatable factors.

Specifically, the injuries in focus are muscle strains happening in the lower body. To grasp the seriousness of this problem, consider the below charts:

A large number of matches are missed by players due to injuries - a costly experience for teams, and an unpleasant one for both players and soccer fans.

A wide collection of player level information was used to explore the existing relationship of many variables and injury risk, using visual and statistical modeling methods.

While it was not possible to establish an accurate prediction system for individual injuries, the analysis demonstrates that there are other valuable insights that could be gathered. Among the most important ones are the impact of past injuries on future risk.

There are believed to be two roadblocks to high-accuracy predictions: 1. Injuries are rare events, which makes it hard to model them from a technical perspective 2. Any dataset of historical player injuries is inherently biased, as coaches and teams already apply their own mitigation methods.

The last section contains suggestions for future work, among them an interesting concept of investigating injuries in multiple-day time windows, rather than treating them as in-game occurrences.

2 Scope

Data used for this exercise is a per game per player level collection of the Premier League matches between 2003 and 2018.
This data is publicly available, however not in an organized format, hence it cannot be shared without explicit permission of data owners.

The target variable for prediction is “injured” - 1 if player was reported injured after the game, 0 otherwise. Descriptive data is also available on the type of the injury, and how long it reportedly last.

The list of Other variables can be found in the Appendix.

3 Data Preparation

3.1 Filters

As the original dataset was in raw format (as collected from the internet), certain filters were needed to be applied before data was ready for meaningful modeling work.

3.1.1 Years

Years before 2010 were removed as their injury data was non-complete.
Year 2018 was also removed - the Premier League season was still in progress at the time of data collection.

3.1.2 Missing data

In certain cases fields contained missing values (NAs) for some records.
Given the limited number of rows affected, they were removed from the dataset, rather than NAs being imputed by other values.
Fields affected:
* Foot
* Games (or minutes) played
* Height & Weight

3.1.3 Injury types & length

Unfortunately soccer injuries come in many different formats and their seriousness can be of a wide range.

The aim in this project is not to predict all possible injury types. The goal is to help decision makers (coaching staff) monitor the players’ physical condition, and manage their playing time if neccessary for avoiding typical soccer injuries.

Any successful prediction exercise would need to make sure that the event being predicted does have a statistical connection to the predictors. For many injury types, we know that their occurance is unrelated to outside conditions (at least to our current knowledge). Examples include knocks, concussions, etc. This analysis is not interested in these.

Rather, the injuries we are after are the “wear and tear” type - breakdowns of the human body related to increased physical stress, either in intensity or length (Unfortunately, most of the available data describes length - e.g. games played / rather than intensity - e.g. maximum speed).
Based on their mechanical relationship and similar length profiles, injuries to be in focus are:
1. Hamstring
2. Groin Strain
3. Calf Muslce Strain
4. Thigh Muscle Strain

Even looking at the same injury types sometimes can be “apples to oranges”, as their severity might be significantly different. The question “How to measure the severity of an injury?” can be answered many ways.

In this analysis, the descriptive data available on injuries is their reported length. As seen above, wide ranges are covered. Should we treat a hamstring strain lasting 1 day the same as one lasting 2 months?

How this question is handled is an important part of the analysis. One could certainly try to weight injuries by their length. This would be a good input into a cost-benefit analysis of resting players for “high-risk” matches.
Note, however, that to avoid introducing unneccessary complexity, injuries lasting less than 2 weeks will not be considered an injury for the purposes of this analysis.

This does remove a large part of the “injured” data (see below - removed partition is marked by orange), but it helps create a more homogenous categorization.

3.1.4 Position

Goalkeepers were removed from the population, as their behaviour on the court is radically different from that of other positions - as we see, their injury rates are significantly lower as well. Focus going forward is on field players.

Records with missing position were also removed - upon structural checks, the quality of these rows were not deemed truthworsty for further investigation.

3.1.5 Playing time

No filter (other than removing NA values) were applied to playing time fields.
While it is a reasonable thought that minimal playing times (either for some bench players, or early season records) could be not contributing to injuries, data suggests otherwise.
The above chart supports this - records with less than 180 season minutes have no lower injury rates than others.

3.2 Feature Engineering

There are certain variables that were not available in the raw data format, however they could be calculated / extracted, with the aim of enhancing the analytical separation between injured / non-injured cases.

Below is a comprehensive list of all variables created during the data preparation:
* Weekday and Month
* BMI
* Team and Opponent
* Birthplace and Nationality
* Kick-off time (time of the day)
* Injury “history” (indicator whether player was injured in a previous time window)
* Games played in a given time window (career / last year / last 90 days)
* Some variables were grouped into larger categories (e.g. Venue, Team)

An important technical note on categorical variables: before feeding them into predictive models, all categoricals were dummy-encoded.

3.3 Exploratory Analysis

3.3.1 Intro, explain chart types

In this section, injury rates (number of injuries / number of records) will be broken down by categories of different variables.

For each variable, there are 2 charts (from left to right):
1. Error bar encoding the 95% confidence interval of injury rates in a given category. Point at the mean, point size encodes number of games in the category.
2. Average injury length in a given category, breaken out by injury type

Commentary will be only provided on the variables deemed most important by the machine learning models applied.

3.3.2 By physical attributes - age, height, weight, BMI

Age has a more or less positive linear connection to the likeness of an injury - the older the player is, the more likely the injury.
It has to be noted however that the impact is not very large, specially considering that the two “outlier” groups (youngest/oldest) do not contain a large amount of players.

Even though BMI is considered a significant predictor in some models, visual analysis does not uncover any clear trends.
This might be due to an unlinear relationship - tree-based methods put more weight on BMI than linear ones.

3.3.3 By Kick-off time

3.3.4 By year, month & weekday

There was a clear trend during the years in injury frequencies, first an increase leading up to 2015, and drops in consecutive years since then.

3.3.5 By nationality & birthplace

Nationality and birthplace had been selected by multiple models. Most notably, players from Africa are prone to getting injured somewhat more frequently.

3.3.6 By team, opponent and venue

Team and venue both have considerable importance according to models. This might be a sign of the impact of playing styles (e.g. speed, physicality). However, these variables are just “distant” proxies for style of play, hence no theories should be made on casuality.

3.3.7 By games & minutes played

Aggregate data shows a reverse relationship between play-load before a match and injury frequencies.

This is counterintuitive at first - the expectation is that the more playing time, the more load on a players body, leading to higher risk of injury.

There might be a selection bias however - players who “survive” without being injured will be available to play the most time during a season. This work will only note this discrepancy, and not provide a commentary on causality in lack of information.

3.3.8 By injury history

Injury history shows clear signs in defining future injury risk. Players who had been injured before not just tend to be more likely to get re-injured, but recent injuries also weight more on a player. Players with more recent injuries, if injured again, tend to miss more time as well.

4 Modeling

4.1 Methodology

4.1.1 Avoiding overfitting

To avoid overfitting, a standard approach of separating training / validation / performance sets was used.

Year 2017 is set aside for final performance evaluation:

training_set    <- original_sample %>%
                     filter(Year < 2017)

performance_set <- original_sample %>%
                     filter(Year == 2017)

Data from other years is split into a training and test set, based on a 70% - 30% split:

training_ratio <- 0.70

train_indices <- createDataPartition(y = training_set[["injured"]],
                                     times = 1,
                                     p = training_ratio,
                                     list = FALSE)

data_train <- training_set[train_indices,  ] %>% as.data.frame()
data_test  <- training_set[-train_indices, ] %>% as.data.frame()

Note: cross-validation was used on the training set, for model tuning

4.2 The problem of class imbalance

Based on the previous definition of “injured”, the cleaned data set consist of only 1.02% of positive cases. This makes it highly imbalanced, something that can hurt model performance for classification tasks.

More precisely: “doing classification when classes are highly imbalanced leads to underestimation of conditional probabilities of the minority class” (Delgado 2017).

Specifically looking at the injury prediction problem: the cost of missclassifying a case-of-interest (injured == YES) observation is higher than the cost of a reversed error. It’s very important to know when a player is in risk of getting injured, as if the event occurs, he can be out for a long period of time, costing the team money and available resource to compete as well. On the other hand, if a player is in good shape, an extra rest day can only hurt so much.

In machine learning application there are multiple ways of addressing this issue (C.V. KrishnaVeni 2011). Main ones include, but are not limited to:
1. Sub-sampling
2. Cost sensitive learning - weighting observations
3. Learning methods, like one-class classifiers (e.g. autoencoders)

Let’s review 1. in more detail, which was used for this analysis.

4.2.1 Sub-sampling techniques for learning from imbalanced data

There are three possible ways to sub-sample data for better predictive performance: up-sampling, down-sampling, and methods mixing the two, like SMOTE. Each carryies their own benefits and disadvantages, as per (Anonymus 2017).

  1. Down-sampling the majority class In this case wer are removing observations from the majority class to make the data more balanced. Usually this method is best when the available data size is large. It can reduce burden of computation effort and storage, however on the flip side it could potentially remove useful information.

  2. Up-sampling the minority class Oversampling repeats examples of the minority class, usually via bootstrapping. There is no information loss that happens with down-sampling, however computation times can increase. Also, by repeatation of the same observations, this method can lead to overfitting.

  3. SMOTE: Synthetic Minority Over-sampling TEchnique An alternative approach is “mixing” the above two techniques. Down-sampling has an obvious drawback in loss of information. Up-sampling also risks to identify more similar, but also more specific regions of the feature space, hence not being able to generalize well in certain cases. SMOTE (Chawla et al. 2002) creates synthetic new examples from the minority class, rather than just repeating existing observations. The method is actually inspired by an older image recognition technique, where the same pictures are getting slightly distorted (e.g. via rotation). The idea behind SMOTE works well in theory - however, depending on the data it might introduce new problems. Such a problem could be “noisy” data - when cases of classes are not well separated, SMOTE can enlarge this overlap, as it does not take distribution into consideration when creating synthetic new examples. For similar reasons, SMOTE tends to not work well for high dimensional data.

Another important aspect of sub-sampling is when to apply it in the modeling workflow (Kuhn 2018). If subsampling is done before fitting a model, it will introduce two problems:
1. For model tuning, the held-out sets in CV will also be sub-sampled. This can create unrealistic or overly optimistic performance measures.
2. Added uncertainity: we won’t know for sure how results would look like with a different subsample. Similar effects arise as in the point above.

4.2.2 A practical comparison of different sub-sampling methods

Below is a summary of different sub-sampling techniques applied to the injury prediction problem. For this analysis, similar Random Forest models were tuned with the different sampling techniques.

Note: Technical details of the model fitting process are covered in a later section. For now, just note that the “caret” R library (Jed Wing et al. 2018) was used to create all models. Caret allows for a nice and consistent user interfence, hence it makes it easier to comprehend code referenced in this document. Random Forests were fit with the “ranger” library (Wright and Ziegler 2017), while SMOTE sampling uses the “DMwR” library (Torgo 2010).

4.2.2.1 AUC and ROC curves

First, let’s refer to the all-in-one metric of classification problems, area under the curve:

Model AUC
No Sampling 0.6103
Up-sampling 0.6104
Down-sampling 0.6045
SMOTE 0.6159

As seen above, AUC values are not creating a clear separation between different sampling techniques. SMOTE does somewhat better than others, but the differences are not significant.

This calls for a more detailed inspection, starting with the ROC curves:

ROC curves are also very similar. SMOTE may be somewhat better performing at low and high specificity rates, while others seem better in the middle. Now let’s consider something: the real optimization problem is maximizing the time a player can spend on the soccer field. The costliest case is when an injury event is not detected - false negatives can create a lot more headache than false positives. Or so it seems at first - but the possible number of false positives is a lot higher than that of false negatives. This problem will be explored further in the cost-benefit analysis section.

Neither sampling method is particularly strong, which projects the issues of using models for decision making of player resting.
This is not he only use-case of analytics however: ranking players by injury risk, or understanding the driving forces behind injury probabilities can also provide value to either coaches, or also to front-offices evaluating different players.

The first aspect, how realistic the probabilities predicted are on average, will be reviewed here. The impact of different variables are covered when discussing modeling results.

4.2.2.2 Probability calibrations

Any sub-sampling method will modify the a-prior probabilites of the training data, hence the predicted probabilities of the classifier will be biased.

The below chart demonstrates this problem in practice:

Many methods exist for calibration of sample-biased probabilites. Practical and simple, formula based solutions can be found in (Deutsch 2010) and (Pozzolo et al. 2015). Other methods include regressing the target variable to predictions, but this is not a feasible approach if the original dataset is highly imbalanced.

This analysis implemented the rescaling as below, where original_fraction is the ratio of the minority class in the original sample, oversampled_fraction is the biased predicted ratio, and score is the individual prediction for a given instance.

f_recalibrate <- function(original_fraction, oversampled_fraction, score) {
  per_p <- 1 + (1 / original_fraction - 1) / (1 / oversampled_fraction - 1) * (1 / score - 1)
  p <- 1 / per_p
  
  return(p)
}

Application of this correction results in the new probabilites can be seen below. As demonstrated, these probabilites are better in line with actual occurance ratios.

Again, neither model does particularly well. SMOTE is strongest however in a key issue: on average, it separates cases with different risk levels much better than any other method. No sampling and up-sampling does not allow for any confident ranking, while down-sampling creates more biased estimates for higher risk instances.

This fact, together with the theoretical advantages, and the computational effort aspect makes SMOTE the best choice for conducting further analysis.

4.3 Methods used

4.3.1 Overview

Each method was called with the same caret control function:

ctrl <- trainControl(method = "cv", 
                     classProbs = TRUE, 
                     summaryFunction = twoClassSummary, 
                     sampling = "SMOTE")

Models were tuned via 10-fold cross-validation. Other parameters instruct caret that this is a binary classification problem, and that SMOTE-sampling should be used during the fitting process.

Seeding was applied before each fitting call, to facilitate reproducibility of the results.

set.seed(93)

A total of 5 different methods were fit and tuned to select the best one for predicting injuries.
Below is a review of them one-by-one to understand the differences.

4.3.2 Logit and Probit (GLM)

Logit and Probit are both linear classification methods. They are different in the function which maps the result to the 0 - 1 space (so that it can be used as a probability), which also defines interpretation of coefficients.

These methods work best when it is important that model details could be understood - as a down-side however, they are not able to capture non-linear effects and complex relationships (without further feature engineering), and the lack of regularization could drive them to overfit if the data is high-dimensional.

Logit code:

train(injured ~ .,
      data      = data_train,
      method    = "glm",
      family    = binomial(link = "logit"),
      metric    = "ROC",
      trControl = ctrl)

Probit code:

train(injured ~ .,
      data      = data_train,
      method    = "glm",
      family    = binomial(link = "probit"),
      metric    = "ROC",
      trControl = ctrl)

4.3.3 Reguralized Logit (GLMNet)

GLMNet (Simon et al. 2011) uses an elastic-net penalty which balances the use of lasso and ridge regularization.
The regularization shrinks the coefficient of unimportant variables towards zero, hence it can be used as a variable selector as well. The tuned regularization ensures that the model is not overfit to training data, but GLMNet still does not automatically address the non-linear components.

GLMNet code:

tg_glmnet <- expand.grid(alpha  = c(1:10 / 100),
                         lambda = c(1:10 / 100))

train(injured ~ .,
      data       = data_train,
      method     = "glmnet",
      family     = "binomial",
      metric     = "ROC",
      trControl  = ctrl,
      preProcess = c("center", "scale"),
      tuneGrid   = tg_glmnet)

The tuneGrid parameter instructs caret to complete a grid search of provided alpha and lambda parameters. Standardization is applied, as necessary for regularization.

4.3.4 Random Forest (Ranger)

Random Forest is a tree-based method which requires minimal tuning in exchange for strong performance characteristics. The trees ensure that non-linear patterns are also taken into account for prediction. Random Forest ensambles many decision trees, each built on a bootstrapped sample of the training data. At each step, only a random selection of features is used for decision on best split, hence it reduces the likelihood of relying too much on a small set of variables. This makes random forests very reluctant to overfit.

Random Forest code:

tg_rf <- expand.grid(.mtry          = c(2:12),
                     .splitrule     = "gini",
                     .min.node.size = c(1:8 * 25))

train(injured ~ .,
      data       = data_train,
      method     = "ranger",
      metric     = "ROC",
      trControl  = ctrl,
      tuneGrid   = tg_rf,
      num.trees  = 1000,
      importance = "impurity")

A 1000 trees were fit to make predictions. Tuning parameters included the number of random features selected at each split, the splitting rule, as well as the minimum node size in each tree.

4.3.5 Gradient Boosting (XGBoost)

XGBoost (T. Chen et al. 2018) is a widely-used implementation of extreme gradient boosting. The workings and parameters of XGBoost is well covered here.

While gradient boosting tends to perform better than random forests on many problems, they are complicated to fit well, due to the large number of tuning parameters. Training XGBoost, while reasonabily fast, is also significantly slower than training Random Forests (with ranger), due to the same reason.

XGBoost code:

tg_xgb <- expand.grid(nrounds          = c(250, 500),
                      max_depth        = c(5, 10, 15),
                      eta              = c(0.1, 0.25),
                      gamma            = c(0.1, 0.25),
                      colsample_bytree = c(2:6 / 10),
                      min_child_weight = c(1:4),
                      subsample        = c(5:9 / 10))

train(injured ~ .,
      method    = "xgbTree",
      metric    = "ROC",
      data      = data_train,
      trControl = ctrl,
      tuneGrid  = tg_xgb)

4.4 Results

4.4.1 Predictions

As with before, evaluation will start with reviewing AUC values and ROC curves for models being compared.

4.4.1.1 Evaluation based on AUC and ROC curves

Model AUC
Logit 0.5635
Probit 0.5629
GLMNet 0.5752
Random Forest 0.6159
XGBoost 0.5896

Random Forest has a significant, 0.025 lead over the second best model, XGBoost.

This is due its ability to better classify positives under lower cut-off thresholds, as demonstrated by the ROC curve. XGBoost perform similarly in the upper-range of the FPR-scale.

The linear models are somewhat lagging behind, specially without regularization. This is a good sign that injury probability not always has a linear relationship with the predictor variables.

Unfortunately, no model has a very strong performance. This indicates that classifications could not be a primary decision-making input. Consequences, and mitigation will be covered in sections “Cost-benefit analysis” and “A time window concept for injuries”, respectively.

4.4.1.2 Probability calibrations

Even if predicted probabilities are not accurate enough on their own, they do have value, if their ability to rank players from low to high risk is good, at least on the average.

The two models that produce relatively well-scaled predictions are GLMNet and Random Forest. Suprisingly, XGBoost does not handle well the imbalance - it correctly assumes that most cases have fallen into the very-low probability group, however it is not able to predict on scale for higher risk players.

Based on the above, Random Forest is selected as the best possible model for prediction of injuries. An evaluation of practical value in player resting is covered in the next section.

4.4.2 Cost-benefit analysis

To evaluate the “business” value of the proposed model consider the following evaluation framework:
1. Benchmark is actual missed time - sum of injury length 2. Model-based missed time is calculated as follows: if a player is classified as injury-risk, he will not play in the match. A 3.5 days missed length is assumed. If a player is not classified as injury-risk, but gets injured, the actual injury length will be counted

Benchmark data is from year 2017, which was not used for model tuning and evaluation previously.

The below chart demonstrates the impact of chosing different thresholds:

Unfortunately, it seems that even in the best case, the benefits are neglible. And if a slightly different threshold is chosen, the modeled approach is worst-off than the historical benchmark.

To understand why, consider the confusion matrix visualized at a reasonable threshold rate:

Reference Yes No
Prediction
Yes 15 366
No 86 9154

The true positive rate, the rate of identified injuries is 14.85%. On the other hand, false positive classifications are happening at a below 4% rate. At first this seems like a good alternative to the no-model case.

The problem is that false positives are not costless. Positive classifications mean that the player needs to miss the upcoming match (in the analysis measured as 3.5 days missed). The average missed time for an injury is approximately 37 days. With these costs, the model would need to identify around one tenth as many true positives as false positives to be on par with the benchmark.

This is a case of “no free lunch” in data science - mistakes are costly, hence just because something can be modeled, it does not mean that it can be modeled benefically for the business. The likely challenge (apart from a possible large amount of randomness to injuries) is that teams already try to optimize against injuries - public data might not be enough to beat the best soccer managers in the World.

4.4.3 Variable importance

Caret allows for an easy extraction of relative variable importances for any model kind. This makes it possible to compare the relative importances across models, even if their mechanics are different. The first thing that can be said is that different models weight variables very differently. This is likely due to differences in their ability to grasp non-linearities, as well as being caused by the high correlation among variables.

Focusing on the similarities across all model types:
1. Importance of past injuries
2. Importance of age
3. The impact of amount played recently, measured by many variables 4. And the differences between positions

These results are aligned with the results of visual expoloration.

Random Forest and XGBoost does not allow for an easy interpretation of the direction of the relationship between injuries and a variable, however GLMNet coefficients can be considered for some insight:

##TODO

4.4.4 Considerations regarding external validity

Even if model predictions can not be considered accurate enough to be direct inputs to decision making on player resting, the relationships uncovered between different factors and injury risk is an important insight on its own.

It needs to be considered however if the results can be seen as good generalizations. There are two important aspects that should be weighted: whether the results contain any bias not modeled in the data, and whether the results would project well into space not covered by the data.

4.4.4.1 Model Bias

Model bias arise when predictions are skewed by certain factors not transparently to model owners.
Usually, certain dormant variables (not measured) will have a casual relationship with the target being analyzed. As the real causes are not measured, proxies are being applied, which tend to show high correlation with the underlying.

If the desired variables are not available, caution is needed to ensure that the model is not polluted by unintended (and untrue) consequences.

For the injury prediction problem, the performance across years is to be considered.
It is understood that injury rates across years are not constant, but year itself cannot be a predictor, which creates a modeling challenge.

In some sense unfortunately the models do fail this test - their predictions are relatively flat over years. It cannot be concluded however that this is a problem with the models - occurances of injuries have a random factor for sure, and it was shown that the 95% confidence intervals of actual injury rates are rather wide. What we know is that modeled variables did not show a significant variation over years, but injury frequencies did. This might be just due to random chance.

4.4.4.2 Data-external factors

It should be remembered that data is for the Premier League, seasons between 2010 and 2017.
Each major soccer league has its own playing styles and player compositions at any given time. Some play at a more uptempo pace, with more physicality. In others the focus is rather on finesse. Players can be very different as well - while the Premier League, being among the richest and most historious leagues, have players from all over the world, it is still majorly an English league. This is true for its players and personel as well.

Another aspect is weather. British weather is famous for not being friendly, but compared to more sunny, humid or warm locations the injury risk might be different as well.

This factors are just a question of data, and can be controlled for.

A harder aspect to measure is each player’s individual physical (and psychical) condition. While their is more and more possibility to gather data on these aspects, it is unlikely that it will be made accessible to the public any time soon.

5 Considerations for future work

A clear challenge in injury prediction with a public dataset is that there is no clear indication whether injuries are truly random to the extent of failing to predict them accurately, or if the role of coaches (and their intervention to prevent injuries of players who they see in risk) inherently biases the data. Analysis showed no clear indications that certain there would be significant differences among teams.

As it is generally the case, better data could help enhance the analysis. Specifically more variables indicating player conditions, as well as more descriptive data on past play (distances run, sprints, etc.) could be used to better separate the workload of players before each game.

5.1 A time window concept for injuries

An interesting idea was present in a recent conference paper (Hisham Talukder 2016) on in-game injuries happening in the NBA.

The authors present a time window technique: rather than trying to predict if a player will get injured during a game, they use a multiple-day prediction window for injuries.

Theoretically, the same idea can be applied to soccer injuries. While this project does not provide a deep investigation into the workability of the concept, a brief look is provided into what it could offer.

In total 5 Random Forest models were trained (similarly as before). The comparison of their ROC curves can be seen below:

From the prediction perspective, it looks like that an ideal time window is somewhere in the 14 - 21 days range. It should be noted however, that even with an improvement in the TPR / FPR trade-off, a cost-efficient method could not be implemented, as the ratio of false positives is still too high.

Different time windows can be calibrated similarly - while none is perfect, each one is reasonable in terms of risk ranking:

Last but not least, all windows value the importance of different factors (variables) similarly, even though not identically. This confirms the role these factors have in injury risk, but the differences also encourge for further feature insepction.

6 Appendix

6.1 Contact details

The author of this paper can be contacted directly via t.koncz@gmail.com.

6.2 Project organization

This project is breaken into separte, runable parts, which are not included in this markdown file. The main parts are:
1. Raw Data
2. Runable R scripts, storing the steps of the analysis
3. Functions
4. Models
5. Notebooks
6. Figures
7. Data

Further publicly disclosable details, and specific code is available on the github repository of this project.

6.3 List of variables in raw data

Variable Description
Date date of the game (YYYY-MM-DD)
balance absolute full time goal difference=|ft_a - ft_h|
ft_a full time away team goals
ft_h full time home team goals
Attendance number of fans at the game
Game week game week of the season
Kick-off kickoff time (GMT)
Venue Location of the stadium
home whether player lines up or the home side or not
ht_a half time away team goals
ht_h half time home team goals
injured whether player was reported injured after the match, (before playing any other match) or not
mid unique match id
minutes minutes played by the player on the match (stoppage time is discarded, so cannot be more than 90)
pid unique player id
starter whether player started the match or not
injury_length if player was reported after the game, how many days did the injury reportedly last
injury_type what type of injury was reported
injury_minutes_played equals the minutes variable in case of injured = 1, otherwise 0
days_till_injury how many days between the date of the match and start of reported injury
pl/all_minutes/games_n/season how many games/minutes of premier league/any other football played by player in the n/days before the match (or in the season, total)
Country of birth player country of birth
Date of birth player DoB
First name player first name
Foot player preferred foot
Height player height
Last name player last name
Nationality player nationality
Place of birth player place of birth
Position player position
Weight player weight (at time of collecting data)
away_team away team name
away_tid away team id
home_team home team name
home_tid home team id

6.4 Sessioninfo

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ranger_0.10.1    xgboost_0.71.2   glmnet_2.0-16    foreach_1.4.4   
##  [5] Matrix_1.2-14    DMwR_0.4.1       caret_6.0-80     lattice_0.20-35 
##  [9] forcats_0.3.0    stringr_1.3.1    dplyr_0.7.6      purrr_0.2.5     
## [13] readr_1.1.1      tidyr_0.8.1      tibble_1.4.2     ggplot2_3.0.0   
## [17] tidyverse_1.2.1  pander_0.6.2     kableExtra_0.9.0 knitr_1.20      
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137        bitops_1.0-6        xts_0.11-0         
##  [4] lubridate_1.7.4     dimRed_0.1.0        httr_1.3.1         
##  [7] rprojroot_1.3-2     tools_3.5.1         backports_1.1.2    
## [10] R6_2.2.2            KernSmooth_2.23-15  rpart_4.1-13       
## [13] lazyeval_0.2.1      colorspace_1.3-2    nnet_7.3-12        
## [16] withr_2.1.2         tidyselect_0.2.4    curl_3.2           
## [19] compiler_3.5.1      cli_1.0.0           rvest_0.3.2        
## [22] xml2_1.2.0          caTools_1.17.1.1    scales_0.5.0       
## [25] sfsmisc_1.1-2       DEoptimR_1.0-8      robustbase_0.93-1.1
## [28] digest_0.6.15       rmarkdown_1.10      pkgconfig_2.0.1    
## [31] htmltools_0.3.6     highr_0.7           TTR_0.23-3         
## [34] rlang_0.2.1         readxl_1.1.0        ddalpha_1.3.4      
## [37] rstudioapi_0.7      quantmod_0.4-13     bindr_0.1.1        
## [40] zoo_1.8-3           jsonlite_1.5        gtools_3.8.1       
## [43] ModelMetrics_1.1.0  magrittr_1.5        Rcpp_0.12.17       
## [46] munsell_0.5.0       abind_1.4-5         stringi_1.1.7      
## [49] yaml_2.1.19         MASS_7.3-50         gplots_3.0.1       
## [52] plyr_1.8.4          recipes_0.1.3       gdata_2.18.0       
## [55] pls_2.6-0           crayon_1.3.4        haven_1.1.2        
## [58] splines_3.5.1       hms_0.4.2           pillar_1.3.0       
## [61] reshape2_1.4.3      codetools_0.2-15    stats4_3.5.1       
## [64] CVST_0.2-2          magic_1.5-8         glue_1.3.0         
## [67] evaluate_0.11       data.table_1.11.4   modelr_0.1.2       
## [70] cellranger_1.1.0    gtable_0.2.0        kernlab_0.9-26     
## [73] assertthat_0.2.0    DRR_0.0.3           gower_0.1.2        
## [76] prodlim_2018.04.18  broom_0.5.0         class_7.3-14       
## [79] survival_2.42-3     viridisLite_0.3.0   geometry_0.3-6     
## [82] timeDate_3043.102   RcppRoll_0.3.0      iterators_1.0.10   
## [85] bindrcpp_0.2.2      lava_1.6.2          ROCR_1.0-7         
## [88] ipred_0.9-6

References

Anonymus. 2017. “How to Handle Imbalanced Classification Problems in Machine Learning?” https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/.

C.V. KrishnaVeni, T. Sobha Rani. 2011. “On the Classification of Imbalanced Datasets.” International Journal of Computer Science & Technology 2 (1): 145–48. http://www.ijcst.com/icaccbie11/sp1/krishnaveni.pdf.

Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.” J. Artif. Int. Res. 16 (1). USA: AI Access Foundation: 321–57. http://dl.acm.org/citation.cfm?id=1622407.1622416.

Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2018. Xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.

Delgado, Daniel Bestard. 2017. “Dealing with Highly Imbalanced Classes in Classification Algorithms.” https://medium.com/bluekiri/dealing-with-highly-imbalanced-classes-7e36330250bc.

Deutsch, Guido. 2010. “Overrepresentation - ‘Sas’-Oversampling.” http://www.data-mining-blog.com/tips-and-tutorials/overrepresentation-oversampling/.

Hisham Talukder, Thomas Vincent. 2016. “Preventing in-Game Injuries for Nba Players.” MIT SLOAN Sports Analytics Conference. http://www.sloansportsconference.com/wp-content/uploads/2016/02/1590-Preventing-in-game-injuries-for-NBA-players.pdf.

Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2018. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Kuhn, Max. 2018. “Subsampling for Class Imbalances.” https://topepo.github.io/caret/subsampling-for-class-imbalances.html.

Pozzolo, Andrea Dal, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi. 2015. “Calibrating Probability with Undersampling for Unbalanced Classification.” In IEEE Symposium Series on Computational Intelligence, SSCI 2015, Cape Town, South Africa, December 7-10, 2015, 159–66. doi:10.1109/SSCI.2015.33.

Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5): 1–13. http://www.jstatsoft.org/v39/i05/.

Torgo, L. 2010. Data Mining with R, Learning with Case Studies. Chapman; Hall/CRC. http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR.

Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. doi:10.18637/jss.v077.i01.